To impute the harmonized dataset, we followed the next steps:
The following graph shows folder tree of our project.
In the “0_Documentation” subfolder are saved the current Rmarkdown code along with its html version. Inside the “1.Input_data” subfolder are placed the data resulting from the harmonization process in csv format and a xlsx file with the list of proposed variables to include in the imputation model.
The “2_Code” subfolder contains the original R code for the data processing, for imputation process (including the codes to sent to HPC computer and to merge the resulting iterations) and convergence analysis. Relevant output files and graphs of our job are saved on the “3_Output_data” and “4_Graphs” subfolders.
We run all codes in R version 4.1.0 on a macOS computer and on a linux based High performance computer.
rm(list=ls()) # clean environment
# Data manipulation packages
library(here)
library(data.table)
library(dplyr)
# Imputation packages
library(mitml)
library(mice)
library(micemd)
library(miceadds)
# Graphic packages
library(corrplot)
library(ggplot2)
library(ggtext)
library(plotly)
# Field specific package
library(growthstandards)
By using the “here” package, it is possible to define relative paths in the project folder. Therefore the user only needs to update the input files in the respective subfolder and run the code without require any further modification.
data <- as.data.table(read.csv(here('1_Input_data','ALLDATASET_preliminary.csv')))
index <- as.data.table(readxl::read_xlsx(here('1_Input_data','Index.xlsx'),sheet="Sheet1"))
#It specifies the variables to include according to Lauren and Anneke criteria.
nindex <- as.data.table(readxl::read_xlsx(here('1_Input_data','Index.xlsx'),sheet="Sheet2")) #It specifies the order of the variables included in the imputation model
index <- index[Bkey%in%c(1,2)] # Index of variables proposed by Lauren and Anneke.
vary <- c("studyname",index[,Variable]) # Get a vector with the proposed variables.
data <- as.data.table(data[,..vary]) # Subset the dataset with the proposed variables.
Initially we check the data set to identify possible typing errors and outliers. We separately check the tree group of input variables: Outcome, Exposure, Pregnant woman.
In order to apply the imputation methods on the incomplete database, we need to set all unspecified values (coded as 666, 777, 888, 999) to NA. R does not handle NA values in the same way as SAS, and specifically the mouse package only imputes variables with NA values. Before converting to NA values, we need to check if there is any NA specification that can be used to retrieve information from a particular variable. For example the 777 level of the arb_cling variable is used to further define the CZS variable.
data[, arb_clindiag := ifelse(arb_clindiag==777,6,arb_clindiag)] # We set the NA value ("777") to a level "6" to use it in the creation of CZS variable.
Then with the following instruction, we can replace all the NA values for all variables.
data[data == 666] <- NA
data[data == 777] <- NA
data[data == 888] <- NA
data[data == 999] <- NA
data[data == 9999] <- NA
char <- which(sapply(data, is.character)) #wwhich variable are categorical
datac <- data[,..char]
# Transform observation with string format
data[, inf_head_circ_age_fu1 := as.numeric(sub(" .*", "",data$inf_head_circ_age_fu1))] # remove years at the end of the string
data[, zikv_elisa_ga_1 := as.numeric(sub(" .*", "",data$zikv_elisa_ga_1))] # remove weeks at the end of the string
data[ zikv_pcr_date_1 %in% c("22027","888",""), zikv_pcr_date_1 := NA] # set as NA values not given in date format
data[ zikv_elisa_date_1 %in% c(""),zikv_elisa_date_1 := NA] #set blank strings as NA
data[ symp_date%in%c(""), symp_date:=NA]
#unique(data$othabnorm_spec) # 39 categories
data[,othabnorm_spec := NULL] # We temporal remove this variable from the dataset as contains messy information: 39 categories in total
data[inf_weight < 100, inf_weight := inf_weight*100] #Correct 2 measures with a misplaced period
data[inf_weight > 10000, inf_weight := inf_weight/10]#Correct 1 measure with an extra zero
data[,inf_sex := ifelse(inf_sex==3,NA,inf_sex)] # Assing 3 to NA, as this level is not defined
data[inf_length == 0,inf_length := NA] #Assign to NA all negative lengths
data[inf_head_circ_birth == 0,inf_head_circ_birth := NA] #Assign to NA all circumference with 0 value
data[,endga := ifelse(!is.na(endga),endga,birth_ga)] #end_ga with NA values can be replaced with observed birth_ga values.
data[fet_us_micro_tri2 == 1|fet_us_micro_tri3 == 1, microcephaly_bin := 1] # Microcephaly binary can be defined from fet_us_micro_tri2 and fet_us_micro_tri3 values
data[loss_etiology == 4,loss_etiology := NA] # Level 4 not defined for loss_etiology
By checking the inclusion criteria of each study it was found that the following studies only included women with detected zika infection (Source: Lauren code).
Therefore, one could assign the zika prevalence (zikv_prev = 1) for all women belonging to these studies. Note: According to protocol (Anneke suggestion), that the following studies applied also the same inclusion criteria.
However by inspecting the data we found that on these countries there are a high proportion of women with a negative zika test (zikv_preg = 0 and in the following table defined as zikv_neg).
zik_incA<-c("Brazil_RiodeJaneiro_CunhaPrata","Brazil_SP_RibeiraoPreto_Duarte","Brazil_BahiaPaudaLima_Costa","Spain_Soriano","FrenchGuiana_Pomar","TrinidadTobago_Sohan","Colombia_Mulkey")
data[, .(count = .N,
ziv_pos=sum(ifelse(zikv_preg==1,1,0),na.rm=TRUE),
ziv_neg=sum(ifelse(zikv_preg==0,1,0),na.rm=TRUE),
ziv_na=sum(ifelse(is.na(zikv_preg),1,0),na.rm=TRUE),
inclusion=ifelse(studyname%in%zik_incA,1,0))
, by = studyname]
## studyname count ziv_pos ziv_neg ziv_na inclusion
## 1: Brazil_RiodeJaneiro_CunhaPrata 54 1 4 49 1
## 2: Brazil_SP_RibeiraoPreto_Duarte 513 513 0 0 1
## 3: Brazil_BahiaPaudaLima_Costa 46 13 33 0 1
## 4: Spain_Soriano 278 27 206 45 1
## 5: FrenchGuiana_Pomar 291 291 0 0 1
## 6: TrinidadTobago_Sohan 104 83 3 18 1
## 7: Brazil_RiodeJaneiro_Joao 219 219 0 0 0
## 8: Spain_Bardaji 198 4 194 0 0
## 9: Colombia_Mulkey 85 77 0 8 1
## 10: USA_Mulkey 95 61 17 17 0
For the moment, we have chosen to apply only the inclusion criteria in the studies specified by Lauren.
zik_inc<-c("Brazil_RiodeJaneiro_CunhaPrata","Brazil_SP_RibeiraoPreto_Duarte","TrinidadTobago_Sohan")
data[studyname%in%zik_inc,zik_preg:=1]
Negative values of the following variables were assinged to NA
data[ miscarriage_ga<=0, miscarriage_ga:=NA]# codebook assigned to 666
data[ loss_ga<=0, loss_ga := NA]# codebook assigned to 666
data[ birth_ga<=0, birth_ga:=NA]# codebook assigned to 666
data[ zikv_pcr_ga_1<=0,zikv_pcr_ga_1:=NA] # codebook assigned to 666
data[ zikv_elisa_ga_1<=0,zikv_elisa_ga_1:=NA] # codebook assigned to 666
data[ zikv_ga<=0,zikv_ga:=NA]# codebook assigned to 666
data[ symp_ga<=0,symp_ga:=NA]# codebook assigned to 666
data[ arb_clindiag_ga<=0,arb_clindiag_ga:=NA]# codebook assigned to 666
Observations without a defined date of zika test, but with a date for pcr or elisa test were assigned with the earlier date of the available test dates.
data[, zikv_ga_min:=apply(data[,c("zikv_elisa_ga_1","zikv_pcr_ga_1")], 1, min, na.rm = TRUE)]
data[, zikv_ga_min:=ifelse(is.infinite(zikv_ga_min),NA,zikv_ga_min)]
data[, zikv_gan:=ifelse(is.na(zikv_ga),zikv_ga_min,zikv_ga_min)]
data[, zikv_gan:=ifelse(is.na(zikv_ga),zikv_ga_min,zikv_ga_min)]
data[, zikv_ga_min:=NULL]
data[, zikv_trin:=ifelse(is.na(zikv_gan),zikv_tri,ifelse(zikv_gan<13,0,ifelse(zikv_gan<=27,1,2)))]
data[age<14, age:=NA] # ask "Spain_Soriano" for this age records, for the moment assigned to NA
data[tobacco==3,tobacco:=NA] # level 3 not defined for tobacco variable
@ Anneke could you please add the information related to this variable?
data[!is.na(inf_sex), hcircm2zscore := as.numeric(igb_hcircm2zscore(gagebrth = birth_ga*7,
hcircm=inf_head_circ_birth,
sex=ifelse(inf_sex== 0, "Male","Female")))]
data[, microcephaly_temp := ifelse(hcircm2zscore<=-3,2,
ifelse(hcircm2zscore<=-2,1,
ifelse(hcircm2zscore<=2,0,
ifelse(!is.na(hcircm2zscore),3,NA))))]
data[ is.na(microcephaly), microcephaly := microcephaly_temp]
check_microcephaly <- as.data.table(table(mic_bin=data$microcephaly_bin,
micro=data$microcephaly,
micro_temp=data$microcephaly_temp,
useNA = "always"))
check_microcephaly[N>0] #check this inconsistency.. we priorirized microcephaly variable over microcephaly_bin
## mic_bin micro micro_temp N
## 1: 0 0 0 491
## 2: 1 0 0 10
## 3: <NA> 0 0 86
## 4: 1 1 0 3
## 5: 0 1 1 1
## 6: 1 1 1 8
## 7: <NA> 1 1 2
## 8: 0 2 2 3
## 9: 1 2 2 9
## 10: <NA> 2 2 1
## 11: 0 0 3 49
## 12: <NA> 3 3 16
## 13: 0 0 <NA> 628
## 14: 1 0 <NA> 3
## 15: 0 1 <NA> 2
## 16: 1 1 <NA> 32
## 17: 0 2 <NA> 1
## 18: 1 2 <NA> 10
## 19: 0 3 <NA> 7
## 20: 0 <NA> <NA> 162
## 21: 1 <NA> <NA> 8
## 22: <NA> <NA> <NA> 351
## mic_bin micro micro_temp N
data[,microcephaly_bin:=ifelse(microcephaly%in%c(0,3),0,ifelse(microcephaly%in%c(1,2),1,microcephaly_bin))]
For facilitate the construction of anormality variables we created the following function
#Function returns 1 if there is any anomaly detected and 0 if there were no anomaly detected across the recorded variables , NA no recorded variables.
checkcon<-function(data,col1){ #
data$anyT <- rowSums(data[, .SD, .SDcols = col1], na.rm=T)
data$allNA <- rowSums(!is.na(data[, .SD, .SDcols = col1]))
data$Final<- ifelse(data$allNA==0,NA,ifelse(data$anyT>0,1,0))
return(data$Final)
}
data[,modificated1:=ifelse(fet_us_abn_spec_tri1==0,1,NA)]
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==0,1,NA)]
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==0,1,NA)]
col1<-c("modificated1","modificated2","modificated3","hydrocephaly","calcifications","ventriculomegaly","fet_us_cns_tri2","fet_us_cns_tri3")
data[,neuroabnormality:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(fet_us_abn_spec_tri1==1,1,NA)]
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==1,1,NA)]
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==1,1,NA)]
col1<-c("modificated1","modificated2","modificated3","fet_us_msk_tri2","fet_us_msk_tri3")
data[,contractures:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(fet_us_abn_spec_tri1==2,1,NA)]
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==2,1,NA)]
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==2,1,NA)]
col1<-c("modificated1","modificated2","modificated3","fet_us_cardio_tri2","fet_us_cardio_tri3")
data[,cardioabnormality:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(fet_us_abn_spec_tri1==3,1,NA)]
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==3,1,NA)]
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==3,1,NA)]
col1<-c("modificated1","modificated2","modificated3","fet_us_gastro_tri2","fet_us_gastro_tri3")
data[,gastroabnormality:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(fet_us_abn_spec_tri1==4,1,NA)]
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==4,1,NA)]
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==4,1,NA)]
col1<-c("modificated1","modificated2","modificated3","fet_us_orofac_tri2","fet_us_orofac_tri3")
data[,oroabnormality:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(fet_us_abn_spec_tri1==5,1,NA)]
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==5,1,NA)]
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==5,1,NA)]
col1<-c("modificated1","modificated2","modificated3","fet_us_eyeear_tri2","fet_us_eyeear_tri3")
data[,ocularabnormality:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(fet_us_abn_spec_tri1==6,1,NA)]
data[,modificated2:=ifelse(fet_us_abn_spec_tri2==6,1,NA)]
data[,modificated3:=ifelse(fet_us_abn_spec_tri3==6,1,NA)]
col1<-c("modificated1","modificated2","modificated3","fet_us_genur_tri2","fet_us_genur_tri3")
data[,genurabnormality:=checkcon(data=data,col1=col1)]
col1<-c("neuroabnormality","contractures","cardioabnormality","gastroabnormality","oroabnormality","ocularabnormality","genurabnormality","othabnorm","fet_us_bin_tri1","fet_us_bin_tri2","fet_us_bin_tri3")
data[,anyabnormality_czs:=checkcon(data=data,col1=col1)]
col1<-c(col1,"anyabnormality_czs")
WHO definition for CZS: Presence of confirmed maternal or fetal ZIKV infection AND presence of severe microcephaly AND presence of other malformations (including limb contractures, high muscle tone, eye abnormalities, and hearing loss, nose etc.)
data[, czs2 := ifelse((data$zikv_preg==1 | data$fet_zikv==1) &
data$microcephaly==2 & data$anyabnormality_czs==1, 1,
ifelse( data$zikv_preg==0 & data$fet_zikv==0 &
data$microcephaly!=2 & data$anyabnormality_czs==0,
0,NA))]
data[, czsn := ifelse(is.na(czs),czs2,czs)]
data[, czs2 := NULL]
checktable <- data.table(table(czs=data$czs,czsn=data$czsn,useNA = "always"))
checktable[ N>0 ] #Combinations where the czv calculated differs from the given one. czsn gave information of around 100 observations
## czs czsn N
## 1: 0 0 790
## 2: <NA> 0 93
## 3: 1 1 40
## 4: <NA> 1 3
## 5: <NA> <NA> 957
The original dataset contains variables that specify whether the pregnancy stage ends with the birth or death of the baby. When the baby dies the event is classified as miscarriage, loss depending on the time at which the pregnancy ends. In addition, related variables such as pregnancy termination, loss etiology or induced abortion are specified. Many of these variables have a high proportion of missing values which can be avoided by combining them into a time variable and an overall event indicator. In other words, instead of having separate indicator variables for death, miscarriage and lost and time variables associated with each of them, one can have an indicator variable, “bdeath”, that specifies whether the baby was born (bdeath=0) or died (bdeath=1), with a gestational time variable that indicates when the event occurred (bdeath_ga).
We start by combining the information of the variables miscarriage, loss, loss_etiology, birth.
data[,birth:=ifelse(!is.na(birth_ga),1,NA)] #birth indicator from bithga
data[,birth2:=ifelse(!is.na(inf_term),1,NA)] #birth indicator from inf_term
data[,bdeath:=ifelse(birth==1|birth2==1,0,NA)]
data[,bdeath_ga:=ifelse(!is.na(endga),endga,ifelse(!is.na(loss_ga),loss_ga,ifelse(!is.na(miscarriage_ga),miscarriage_ga,NA)))]
Then we used the etiology variable to add information on NA observations or ratify information in other pregancy term variables
The following data.table instruction allows us to assign the values on misscarriage, loss, loss_etiology, birth and bdeath in one step given a loss etiology condition.
#Etiology levels (0=live, 1=miscarriage, 2=loss, 3=imp death)
data[loss_etiology==0&(is.na(bdeath)|bdeath==0),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)] #birth
data[loss_etiology==1&(is.na(bdeath)|bdeath==1),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(1,0,1,0,1)] #miscarriage
data[loss_etiology==2&(is.na(bdeath)|bdeath==1),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)] #loss
data[loss_etiology==3&(is.na(bdeath)|bdeath==1),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)] #class as loss
Here indicators of bdeath are ratified with information of bdeath_ga
data[bdeath==1&bdeath_ga<20,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(1,0,1,0,1)]
data[bdeath==1&bdeath_ga>=20,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)]
Observations with loss =1 and birth=1, allows us to specify other variables
data[loss==1&is.na(loss_etiology),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)]
data[birth==1&is.na(loss_etiology),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
Then we used inf_vital status to ratify information or correct contradictory observations
data[miscarriage==0&loss==0&is.na(birth)&inf_vital_status==0,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
data[birth==1&loss_etiology==1&inf_vital_status==0,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
data[birth==0&inf_vital_status==0,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
Finally we use the inf_term and induceabort to correct or compleate some observations
data[!is.na(data$inf_term),c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,0,0,1,0)]
data[inducedabort==1&birth==1,c("miscarriage","loss","loss_etiology","birth","bdeath"):= list(0,1,2,0,1)]
We combine the information of all the pregnancy term variables and check if there are observations with inconsistente relationships and remove temporal variables not need it in the final dataset.
#Coherent with inf_alive_birth 0=alive, induce abort 0=No
checktable<-data.table(table(miscarriage=data$miscarriage,
loss=data$loss,
etiology=data$loss_etiology,
birth=data$birth,
bdeath=data$bdeath,
vstatus=data$inf_vital_status,
abort=data$inducedabort,
infterm=!is.na(data$inf_term), useNA="always")) #V1 is miscarriage, V2 is loss and N the number of observations
checktable[N!=0,]
## miscarriage loss etiology birth bdeath vstatus abort infterm N
## 1: 0 0 0 1 0 0 0 FALSE 46
## 2: 0 0 0 1 0 <NA> 0 FALSE 217
## 3: 1 0 1 0 1 <NA> 0 FALSE 36
## 4: 0 1 2 0 1 <NA> 0 FALSE 11
## 5: 0 0 <NA> <NA> <NA> <NA> 0 FALSE 8
## 6: 0 <NA> <NA> <NA> <NA> <NA> 0 FALSE 6
## 7: 1 0 1 0 1 <NA> 1 FALSE 2
## 8: 0 1 2 0 1 <NA> 1 FALSE 14
## 9: 0 0 0 1 0 <NA> <NA> FALSE 3
## 10: 1 0 1 0 1 <NA> <NA> FALSE 1
## 11: 0 1 2 0 1 <NA> <NA> FALSE 17
## 12: <NA> <NA> <NA> <NA> <NA> <NA> <NA> FALSE 58
## 13: 0 0 0 1 0 0 0 TRUE 1311
## 14: 0 0 0 1 0 1 0 TRUE 4
## 15: 0 0 0 1 0 2 0 TRUE 3
## 16: 0 0 0 1 0 <NA> 0 TRUE 91
## 17: 0 1 2 0 1 <NA> 1 TRUE 3
## 18: 0 0 0 1 0 <NA> <NA> TRUE 52
data[,birth:=NULL]
data[,birth2:=NULL]
col1<-c("med_bin","med_anticonvuls_bin","med_preg_bin","med_fertil_bin")
data[,drugs_prescr:=checkcon(data=data,col1=col1)]
col1<-c("vac_rub_enroll","vac_vari_enroll","vac_yf_enroll")
data[,vaccination:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(storch==0,0,ifelse(!is.na(storch),1,0))]
col1<-c("modificated1","storch_bin","toxo","toxo_treat","syphilis","syphilis_treat","varicella","parvo","rubella","cmv","herpes","listeria","chlamydia","gonorrhea","genitalwarts")
data[,storch_patho:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(arb_clindiag_plus==0,0,ifelse(!is.na(arb_clindiag_plus),1,0))]
data[,modificated2:=ifelse(arb_clindiag!=0&arb_clindiag!=1,0,ifelse(!is.na(arb_clindiag),1,0))] #arb_clindiag==777 to 6 on top
col1<-c("modificated1","modificated2","denv_ever","chikv_ever")
data[,flavi_alpha_virus:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(zikv_pcr_everpos==1,1,ifelse(zikv_pcr_everpos==0,0,NA))] # 2 indeterminated
col1<-c("modificated1","zikv_elisa_everpos","denv_ever","chikv_ever")
data[,arb_ever:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(zikv_pcr_res_1==1,1,ifelse(zikv_pcr_res_1==0,0,NA))] # 2 indeterminated
data[,modificated2:=ifelse(zikv_elisa_res_1==1,1,ifelse(zikv_elisa_res_1==0,0,NA))]
data[,modificated3:=ifelse(arb_clindiag==0,0,ifelse(!is.na(arb_clindiag),1,NA))] #arb_clindiag==777 to 6 modified at the beginning= other arbovirus
col1<-c("modificated1","modificated2","modificated3","zikv_preg","denv_preg","chikv_preg")
data[,arb_preg:=checkcon(data=data,col1=col1)]
data[,modificated1:=ifelse(arb_clindiag==0|arb_clindiag==1,0,ifelse(!is.na(arb_clindiag),1,NA))] #arb_clindiag==777 to 6 modified at the beginning= other arbovirus
col1<-c("modificated1","denv_preg","chikv_preg")
data[,arb_preg_nz:=checkcon(data=data,col1=col1)]
Van Buuren proposed the influx and outflux values as selection criteria of predictor variables. The influx is a measure of how the missing observations of a variable is connected with observed data of other variables. Whereas the outflux value is provide an estimation of how the observed data of a variable is connected with incomplete data of the other variables. The outflux parameter is an indicator of potential power of a variable for imputing other variables. In this case we define that a variable could be a good predictor if it outflux value is equal or superior to 50% .i.e. could help to predict more that the 50% of variables with incomplete data.
fx<-flux(data)
fluxplot(data,eqscplot=TRUE)
outlist<-row.names(fx)[fx$outflux>=0.5]
sort(outlist)
## [1] "age" "anyabnormality_czs" "arb_ever"
## [4] "arb_preg" "arb_symp" "bdeath"
## [7] "bdeath_ga" "birth_ga" "conjunctivitis"
## [10] "endga" "fever" "gravidity"
## [13] "hiv" "hydrocephaly" "inducedabort"
## [16] "inf_head_circ_birth" "inf_sex" "inf_term"
## [19] "inf_vital_status" "inf_weight" "loss"
## [22] "loss_etiology" "microcephaly" "microcephaly_bin"
## [25] "mid_original" "miscarriage" "multiplegest"
## [28] "neuroabnormality" "parity" "rash"
## [31] "storch" "storch_bin" "storch_patho"
## [34] "studyname" "symp_tri" "ventriculomegaly"
## [37] "zikv_gan" "zikv_pcr_date_1" "zikv_pcr_everpos"
## [40] "zikv_pcr_ga_1" "zikv_pcr_res_1" "zikv_pcr_tri_1"
## [43] "zikv_preg" "zikv_trin"
We calculate the proportion of missingess of each varable on each study and place it in a matrix format that allow us to plotit.
nindex1 <- nindex[order(order),][!is.na(order)] # order according categories: pregnancy women, exposure, outcome
pldata<-data[, .SD, .SDcols = nindex1$names]
dmatrix<-pldata[, lapply(.SD, function(x) sum(is.na(x))/.N), studyname] #calulate % missingness
dmatrix2<-melt(dmatrix,id.vars="studyname") # change study name to display
dmatrix2[,name:=fcase(
studyname=="Brazil_RiodeJaneiro_CunhaPrata","Brazil\nCunhaPrata",
studyname=="Brazil_SP_RibeiraoPreto_Duarte","Brazil\n Duarte",
studyname=="Brazil_BahiaPaudaLima_Costa","Brazil\nCosta",
studyname=="Spain_Soriano" ,"Spain\nSoriano",
studyname=="FrenchGuiana_Pomar" ,"Fr.Guiana\nPomar",
studyname=="TrinidadTobago_Sohan" ,"Tri.Tobago\nSohan",
studyname=="Brazil_RiodeJaneiro_Joao" ,"Brazil\nJoao",
studyname=="Spain_Bardaji" ,"Spain\nBardaji",
studyname=="Colombia_Mulkey" ,"Colombia\nMulkey",
studyname=="USA_Mulkey" ,"USA\nMulkey"
)]
dmatrix2[,miss:=round(value*100,1)]
dmatrix2[,text:=paste0("study: ", studyname, "\n", "variable: ", variable, "\n", "miss%: ",miss)] # set the text displayed by place the cursor over especific observation.
The ploty package allows user to zoom out sections of the plot, by placing the cursor over an observation is diplayed the % missing by study.
names<-nindex1$names
color<-ifelse(nindex1$Mod_var == 1, "red", "black")
nindex1$color<-color
p<-ggplot(dmatrix2, aes(x=name, y=variable,fill=miss,text=text)) +
geom_tile() +
scale_fill_gradientn(colours=c("green","yellow","red")) +
theme(axis.text.x = element_text(angle = 45,vjust=0.5,size=8),
axis.text.y = element_text(size=8))+xlab("Study name")+ylab("Variable")+
labs(fill='%missing')
ggplotly(p, tooltip="text")
We subset the dataset with the selected variables.
nindexf<-nindex[Final==1,]
fdata<-data[, .SD, .SDcols = nindexf$names]
summary(fdata)
## inducedabort bdeath bdeath_ga hydrocephaly
## Min. :0.00000 Min. :0.00000 Min. : 7.00 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:38.00 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :39.00 Median :0.0000
## Mean :0.01084 Mean :0.04638 Mean :38.13 Mean :0.0035
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:40.00 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :47.30 Max. :1.0000
## NA's :131 NA's :72 NA's :656 NA's :741
## corticalatrophy calcifications ventriculomegaly othabnorm
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.0195 Mean :0.0229 Mean :0.0413 Mean :0.0704
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :1010 NA's :1011 NA's :745 NA's :1215
## microcephaly neuroabnormality anyabnormality_czs czsn
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1211 Mean :0.1088 Mean :0.1616 Mean :0.0464
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :3.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :521 NA's :679 NA's :621 NA's :957
## igr_curr_preg multiplegest inf_sex inf_weight
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 100
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2868
## Median :0.0000 Median :0.0000 Median :0.0000 Median :3175
## Mean :0.0447 Mean :0.0202 Mean :0.4745 Mean :3138
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3500
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :7000
## NA's :1123 NA's :299 NA's :686 NA's :500
## inf_length inf_head_circ_birth zikv_preg zikv_gan
## Min. : 4.00 Min. :15.00 Min. :0.0000 Min. : 2.00
## 1st Qu.:47.50 1st Qu.:33.00 1st Qu.:0.0000 1st Qu.: 15.14
## Median :49.00 Median :34.00 Median :1.0000 Median : 23.43
## Mean :48.58 Mean :33.76 Mean :0.7383 Mean : 23.24
## 3rd Qu.:50.00 3rd Qu.:35.00 3rd Qu.:1.0000 3rd Qu.: 31.00
## Max. :54.00 Max. :47.50 Max. :1.0000 Max. :111.14
## NA's :956 NA's :569 NA's :137 NA's :850
## fever rash conjunctivitis arb_symp
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.2336 Mean :0.4934 Mean :0.1616 Mean :0.5791
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :329 NA's :286 NA's :336 NA's :403
## arb_ever arb_preg_nz age gravidity
## Min. :0.0000 Min. :0.0000 Min. :14.00 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:23.00 1st Qu.: 1.000
## Median :1.0000 Median :0.0000 Median :28.00 Median : 2.000
## Mean :0.8925 Mean :0.0996 Mean :27.74 Mean : 2.528
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:32.00 3rd Qu.: 3.000
## Max. :1.0000 Max. :1.0000 Max. :49.00 Max. :14.000
## NA's :69 NA's :869 NA's :546 NA's :665
## parity hiv storch_patho studyname
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Length:1883
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median : 1.000 Median :0.0000 Median :0.0000 Mode :character
## Mean : 1.673 Mean :0.1728 Mean :0.4404
## 3rd Qu.: 2.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :12.000 Max. :1.0000 Max. :1.0000
## NA's :681 NA's :581 NA's :675
Then we specify vector for each type of variable
ncon<-c("age","bdeath_ga","zikv_gan","inf_weight","inf_head_circ_birth","inf_length") # name of continuous variables
nncon<-colnames(data)[!colnames(data)%in%ncon] # name of no continuous variables
ncat<-c("gravidity","parity","microcephaly") #name of categorical variables
nbin<-nncon[!nncon%in%ncat] #name of binary variables
We initialize a mice object to later on set the methods, prediction and post arguments
ini <- mice(data, maxit = 0)
## Warning: Number of logged events: 42
meth <- ini$meth
pred <- ini$pred
post <- ini$post
In general we assigned clusterized methods for all variables, however we set marginalized methods for those variables which their distribution does not vary across studies or that result in convergency problems.
meth[ncon] <- "2l.norm" # normal distribution at study level
meth[c("inf_weight","inf_head_circ_birth","inf_length","zikv_gan")]<-"norm" # Due to convergence problems imputation based on marginal distribution
meth[ncat] <- "2l.pmm" # predictive mean matching at study level
meth[c("na_b")] <- "2l.pmm" # set with ppm at study level
meth[nbin] <- "2l.bin"# binomial distribution at study level
meth[c("corticalatrophy","ventriculomegaly","multiplegest","othabnorm","calcifications")] <- "logreg" # Due to frequent singular fit warning imputation based on marginal distribution
To construct the prediction matrix, we used as predictors all variables that have a Pearson correlation with the imputed variable of at least 0.3. In addition, we specify the studyname as a cluster indicator.
pred <- quickpred(data, mincor=0.3) # assignation based on pairwise correlaition
pred[, c("bdeath_ga")] <- 0 # as information of prediction is given by nelson allen variable na_b
pred[,"studyname"] <- -2 # define the cluster for imputation models at study level.
pred["studyname", "studyname"] <- 0
For all continuous variables, we added a post-processing step that censored all imputed values outside the observable range of each variable.
post["age"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(14, 50))"
post["bdeath_ga"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(0, 50))"
post["zikv_gan"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(0, 65))"
post["inf_weight"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(100, 7000))"
post["inf_head_circ_birth"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(15, 50))"
post["inf_length"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(18, 60))"
Finally we created the mice object, running 20 iterations for each 10 imputations.
micesurv <- mice(data, predictorMatrix = pred,method=meth,post=post, maxit =20, m = 10)
Due to the exhaustive processing time, we run the simulations in a high performance computer, all the codes are save it in the subfolder codes.